{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "vvANigl6Hrcd"
      },
      "source": [
        "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/LanguageToolkit/blob/main/langkit/examples/Sentiment_and_Toxicity.ipynb)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "frxW2u3RJk1D"
      },
      "source": [
        "## Install LangKit\n",
        "\n",
        "First let's install __LangKit__."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "f5IsgjyuKGWJ"
      },
      "outputs": [],
      "source": [
        "%pip install langkit[all]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "TAeYfAYnHp-u"
      },
      "source": [
        "# Tracking Sentiment and Toxicity Scores in Text with Langkit\n",
        "\n",
        "In this example, we'll show how you can easily track sentiment and toxicity scores in text with Langkit.\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Px1lNOkHHp-w"
      },
      "source": [
        "As an example, we'll use the [tweet_eval dataset](https://huggingface.co/datasets/tweet_eval). We'll use the `hateful` subset of the dataset, which contains tweets labeled as hateful or not hateful."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "metadata": {
        "id": "H0YNDLPOHp-w"
      },
      "outputs": [],
      "source": [
        "from datasets import load_dataset\n",
        "\n",
        "hateful_comments = load_dataset('tweet_eval','hate',split=\"train\", streaming=True)\n",
        "comments = iter(hateful_comments)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "beyHBap8Hp-x"
      },
      "source": [
        "## Initializing the Metrics\n",
        "\n",
        "To initialize the `toxicity` and `sentiment` metrics, we simply import the respective modules from `langkit`. This will automatically register the metrics, so we can start using them right away by creating a schema by calling `generate_udf_schema`. We will pass that schema to whylogs, so that it knows which metrics to track."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {
        "id": "VRHZrfT2Hp-y"
      },
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "[nltk_data] Downloading package vader_lexicon to\n",
            "[nltk_data]     /home/jamie/nltk_data...\n",
            "[nltk_data]   Package vader_lexicon is already up-to-date!\n"
          ]
        }
      ],
      "source": [
        "from whylogs.experimental.core.udf_schema import udf_schema\n",
        "from langkit import toxicity, sentiment\n",
        "\n",
        "text_schema = udf_schema()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "X_68Wn0lHp-y"
      },
      "source": [
        "## Profiling the Data\n",
        "\n",
        "Now we're set to log our data.\n",
        "\n",
        "To make sure the metrics make sense, we will profile two separate groups of data:\n",
        "- hateful comments: comments that are labeled as hateful\n",
        "- non-hateful comments: comments that are labeled as non-hateful\n",
        "\n",
        "We can expect hateful comments to have a higher toxicity score and a lower sentiment score than non-hateful comments.\n",
        "\n",
        "Let's see if our metrics will reflect that."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "metadata": {
        "id": "9_nLmLE1Hp-z"
      },
      "outputs": [],
      "source": [
        "import whylogs as why\n",
        "\n",
        "# Just initializing the profiles with generic comments.\n",
        "non_hateful_profile = why.log({\"prompt\":\"I love flowers.\"}, schema=text_schema).profile()\n",
        "hateful_profile = why.log({\"prompt\":\"I hate biscuits.\"}, schema=text_schema).profile()\n",
        "\n",
        "for _ in range(200):\n",
        "  comment = next(comments)\n",
        "  if comment['label'] == 0:\n",
        "    non_hateful_profile.track({\"prompt\":comment['text']})\n",
        "  else:\n",
        "    hateful_profile.track({\"prompt\":comment['text']})\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "2rhVn7ToHp-z"
      },
      "source": [
        "Now that we have our profiles, let's check out the metrics. Let's compare the mean for our sentiment and toxicity scores, for each group (hateful and non-hateful):"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 5,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "43mdtqyIHp-0",
        "outputId": "99bc1546-8313-4517-e2a4-aa3241a07226"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "######### Sentiment #########\n",
            "The average sentiment score for the hateful comments is -0.3437611764705882\n",
            "The average sentiment score for the non-hateful comments is -0.04682222222222223\n",
            "######### Toxicity #########\n",
            "The average toxicity score for the hateful comments is 0.345255050238441\n",
            "The average toxicity score for the non-hateful comments is 0.1513899951918511\n"
          ]
        }
      ],
      "source": [
        "hateful_sentiment = hateful_profile.view().get_column(\"prompt.sentiment_nltk\").to_summary_dict()[\"distribution/mean\"]\n",
        "non_hateful_sentiment = non_hateful_profile.view().get_column(\"prompt.sentiment_nltk\").to_summary_dict()[\"distribution/mean\"]\n",
        "\n",
        "hateful_toxicity = hateful_profile.view().get_column(\"prompt.toxicity\").to_summary_dict()[\"distribution/mean\"]\n",
        "non_hateful_toxicity = non_hateful_profile.view().get_column(\"prompt.toxicity\").to_summary_dict()[\"distribution/mean\"]\n",
        "\n",
        "print(\"######### Sentiment #########\")\n",
        "print(f\"The average sentiment score for the hateful comments is {hateful_sentiment}\")\n",
        "print(f\"The average sentiment score for the non-hateful comments is {non_hateful_sentiment}\")\n",
        "\n",
        "print(\"######### Toxicity #########\")\n",
        "print(f\"The average toxicity score for the hateful comments is {hateful_toxicity}\")\n",
        "print(f\"The average toxicity score for the non-hateful comments is {non_hateful_toxicity}\")\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "You can also have more detailed information by calling the profile view's to_pandas method: "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 6,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>cardinality/est</th>\n",
              "      <th>cardinality/lower_1</th>\n",
              "      <th>cardinality/upper_1</th>\n",
              "      <th>counts/inf</th>\n",
              "      <th>counts/n</th>\n",
              "      <th>counts/nan</th>\n",
              "      <th>counts/null</th>\n",
              "      <th>distribution/max</th>\n",
              "      <th>distribution/mean</th>\n",
              "      <th>distribution/median</th>\n",
              "      <th>...</th>\n",
              "      <th>distribution/q_95</th>\n",
              "      <th>distribution/q_99</th>\n",
              "      <th>distribution/stddev</th>\n",
              "      <th>type</th>\n",
              "      <th>types/boolean</th>\n",
              "      <th>types/fractional</th>\n",
              "      <th>types/integral</th>\n",
              "      <th>types/object</th>\n",
              "      <th>types/string</th>\n",
              "      <th>types/tensor</th>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>column</th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>prompt</th>\n",
              "      <td>85.000018</td>\n",
              "      <td>85.0</td>\n",
              "      <td>85.004262</td>\n",
              "      <td>0</td>\n",
              "      <td>85</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>NaN</td>\n",
              "      <td>0.000000</td>\n",
              "      <td>NaN</td>\n",
              "      <td>...</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>0.000000</td>\n",
              "      <td>SummaryType.COLUMN</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>85</td>\n",
              "      <td>0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>prompt.sentiment_nltk</th>\n",
              "      <td>65.000010</td>\n",
              "      <td>65.0</td>\n",
              "      <td>65.003256</td>\n",
              "      <td>0</td>\n",
              "      <td>85</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>0.863400</td>\n",
              "      <td>-0.343761</td>\n",
              "      <td>-0.504000</td>\n",
              "      <td>...</td>\n",
              "      <td>0.476700</td>\n",
              "      <td>0.863400</td>\n",
              "      <td>0.487009</td>\n",
              "      <td>SummaryType.COLUMN</td>\n",
              "      <td>0</td>\n",
              "      <td>85</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>prompt.toxicity</th>\n",
              "      <td>85.000018</td>\n",
              "      <td>85.0</td>\n",
              "      <td>85.004262</td>\n",
              "      <td>0</td>\n",
              "      <td>85</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>0.962283</td>\n",
              "      <td>0.345255</td>\n",
              "      <td>0.056012</td>\n",
              "      <td>...</td>\n",
              "      <td>0.952058</td>\n",
              "      <td>0.962283</td>\n",
              "      <td>0.418585</td>\n",
              "      <td>SummaryType.COLUMN</td>\n",
              "      <td>0</td>\n",
              "      <td>85</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "<p>3 rows × 28 columns</p>\n",
              "</div>"
            ],
            "text/plain": [
              "                       cardinality/est  cardinality/lower_1  \\\n",
              "column                                                        \n",
              "prompt                       85.000018                 85.0   \n",
              "prompt.sentiment_nltk        65.000010                 65.0   \n",
              "prompt.toxicity              85.000018                 85.0   \n",
              "\n",
              "                       cardinality/upper_1  counts/inf  counts/n  counts/nan  \\\n",
              "column                                                                         \n",
              "prompt                           85.004262           0        85           0   \n",
              "prompt.sentiment_nltk            65.003256           0        85           0   \n",
              "prompt.toxicity                  85.004262           0        85           0   \n",
              "\n",
              "                       counts/null  distribution/max  distribution/mean  \\\n",
              "column                                                                    \n",
              "prompt                           0               NaN           0.000000   \n",
              "prompt.sentiment_nltk            0          0.863400          -0.343761   \n",
              "prompt.toxicity                  0          0.962283           0.345255   \n",
              "\n",
              "                       distribution/median  ...  distribution/q_95  \\\n",
              "column                                      ...                      \n",
              "prompt                                 NaN  ...                NaN   \n",
              "prompt.sentiment_nltk            -0.504000  ...           0.476700   \n",
              "prompt.toxicity                   0.056012  ...           0.952058   \n",
              "\n",
              "                       distribution/q_99  distribution/stddev  \\\n",
              "column                                                          \n",
              "prompt                               NaN             0.000000   \n",
              "prompt.sentiment_nltk           0.863400             0.487009   \n",
              "prompt.toxicity                 0.962283             0.418585   \n",
              "\n",
              "                                     type  types/boolean  types/fractional  \\\n",
              "column                                                                       \n",
              "prompt                 SummaryType.COLUMN              0                 0   \n",
              "prompt.sentiment_nltk  SummaryType.COLUMN              0                85   \n",
              "prompt.toxicity        SummaryType.COLUMN              0                85   \n",
              "\n",
              "                       types/integral  types/object  types/string  \\\n",
              "column                                                              \n",
              "prompt                              0             0            85   \n",
              "prompt.sentiment_nltk               0             0             0   \n",
              "prompt.toxicity                     0             0             0   \n",
              "\n",
              "                       types/tensor  \n",
              "column                               \n",
              "prompt                            0  \n",
              "prompt.sentiment_nltk             0  \n",
              "prompt.toxicity                   0  \n",
              "\n",
              "[3 rows x 28 columns]"
            ]
          },
          "execution_count": 6,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "hateful_profile.view().to_pandas()\n"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "display_name": "langkit-vSL8Mxyz-py3.8",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.8.10"
    },
    "orig_nbformat": 4
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
